On cherche à étudier l’effet de trois facteurs sur le transcriptome des racines d’Arabidopsis. On cherche ici à utiliser et comparer différents packages de multiDE sur nos données.
Design, contrastes, et test DE Quelle différence avec une seule condition?
EgdeR and DESeq2 Fit a negative binomial generalized log-linear model to the read counts for each gene. Conduct genewise statistical tests for a given coefficient or coefficient contrast. Ce modèle est de la forme
\(log(\mu_{gj}) \hat= \beta_o + \beta X\) knowming that \(X \sim NB(\mu_{gj}, \sigma_g)\)
are an extension of classical linear models to nonnormallydistributed response data. GLMs specify probability distributions according to their mean-variance relationship, for example the quadratic mean-variance relationship specified above forread counts. Assuming that an estimate is available forφg, so the variance can be evaluatedfor any value ofμgi, GLM theory can be used to fit a log-linear modellogμgi=xTiβg+ logNifor each gene [22]. Herexiis a vector of covariates that specifies the treatment conditionsapplied to RNA samplei, andβgis a vector of regression coefficients by which the covariateeffects are mediated for geneg. The quadratic variance function specifies the negativebinomial GLM distributional family. The use of the negative binomial distribution is equivalentto treating theπgias gamma distributed.
edgeR uses the Cox-Reid profile-adjustedlikelihood (CR) method in estimating dispersions [22]. The CR method is derived to overcomethe limitations of the qCML method as mentioned above. It takes care of multiple factors byfitting generalized linear models (GLM) with a design matrix.The CR method is based on the idea of approximate conditional likelihood [?]. Given atable counts or aDGEListobject and the design matrix of the experiment, generalized linearmodels are fitted. This allows valid estimation of the dispersion, since all systematic sourcesof variation are accounted for.
https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
data <- read.csv("quantifFiles/quantifGenes.csv", h = T, sep = ",")
rownames(data) <- data$Gene
genes = which(!(grepl("__", rownames(data))))
not_quant = data[which((grepl("__", rownames(data)))), ]
data = data[genes, grepl("R", colnames(data))]
head(data) R3 R2 R1 R5 R4 R6 R7 R11 R10 R12 R13 R17 R16 R14
AT1G01010 1526 1006 1116 1275 967 1018 854 1132 1294 1364 2325 2113 2193 2564
AT1G01020 416 285 289 349 364 370 226 513 502 561 461 407 432 614
AT1G01030 31 15 19 29 36 28 12 47 34 47 18 40 32 44
R15 R18 R24 R19 R22 R23 R21 R20 R9 R8
AT1G01010 2364 2074 1987 2027 1754 1697 1537 1898 816 912
AT1G01020 380 502 484 467 426 415 413 462 223 312
AT1G01030 37 27 42 39 36 40 37 37 15 19
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
[1] 28775 24
annot <- read.csv("Code_for_RNAseq_CO2_N_Fr.csv", h = T, sep = ";")
conditions <- as.vector(unique(annot$Sample))
annot$ID <- paste0("R", annot$Code)
annot$condition <- substr(conditions, 1, nchar(conditions) - 1)
cond <- unique(substr(conditions, 1, nchar(conditions) - 1))
cond <- cond[grepl("At", cond)]
getCondition <- function(id) {
return(annot[annot$ID == id, "condition"])
}
getExactCondition <- function(id) {
return(annot[annot$ID == id, "Sample"])
}
getLabel <- function(id) {
text <- as.vector(annot[annot$ID == id, "Sample"])
res <- ""
nb <- substr(text, nchar(text), nchar(text))
if (grepl("Ambient", text)) {
res = paste0(res, "c")
} else {
res = paste0(res, "C")
}
if (grepl("High", text)) {
res = paste0(res, "N")
} else {
res = paste0(res, "n")
}
if (grepl("Starv", text)) {
res = paste0(res, "f")
} else {
res = paste0(res, "F")
}
res = paste0(res, "_", nb)
return(res)
}
getLabel("R6")[1] "cnF_3"
[1] At_AmbientCO2_LowNitrate_Fe1
48 Levels: At_AmbientCO2_HighNitrate_Fe1 ... Sl_ElevatedCO2_LowNitrate_FeStarvation3
[1] "At_AmbientCO2_LowNitrate_Fe"
TCC fait le test globalement, avec un layout “one way”, qui prend les 8 conditions comme toutes différentes. En spécifiant un design de combinatoire \(2*2*2\), on peut avoir des coefficients testés contre 0 pour chacune des variables et leur interactions.
glmLRT conducts likelihood ratio tests for one or more coefficients in the linear model. If coef is used, the null hypothesis is that all the coefficients indicated by coef are equal to zero. If contrast is non-null, then the null hypothesis is that the specified contrasts of the coefficients are equal to zero. For example, a contrast of c(0,1,-1), assuming there are three coefficients, would test the hypothesis that the second and third coefficients are equal.
While the likelihood ratio test is a more obvious choice for inferences with GLMs, the QLF-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. Itprovides more robust and reliable error rate control when the number of replicates is small.The QL dispersion estimation and hypothesis testing can be done by using the functionsglmQLFit()andglmQLFTest()
# data
d <- data
# fixation correcte des contrastes et formule
colnames(d) <- sapply(colnames(d), getLabel)
groups <- str_split_fixed(colnames(d), "_", 2)[, 1]
# colnames(d) <- sapply(colnames(d), getLabel)
y <- DGEList(counts = d, group = groups)
y$samples group lib.size norm.factors
cNF_3 cNF 32033146 1
cNF_2 cNF 25708565 1
cNF_1 cNF 25389227 1
cnF_2 cnF 29410954 1
cnF_1 cnF 26720832 1
cnF_3 cnF 26429803 1
CNF_1 CNF 16944343 1
CnF_2 CnF 28518532 1
CnF_1 CnF 30134923 1
CnF_3 CnF 30939471 1
cNf_1 cNf 30097577 1
cnf_2 cnf 29411503 1
cnf_1 cnf 32417735 1
cNf_2 cNf 34497723 1
cNf_3 cNf 31947553 1
cnf_3 cnf 30167133 1
Cnf_3 Cnf 33780481 1
CNf_1 CNf 30427781 1
Cnf_1 Cnf 29577460 1
Cnf_2 Cnf 30405729 1
CNf_3 CNf 25049585 1
CNf_2 CNf 28496828 1
CNF_3 CNF 18823797 1
CNF_2 CNF 20700940 1
# filtre
keep <- filterByExpr(y)
y <- y[keep, keep.lib.sizes = FALSE]
# normalisation
y <- calcNormFactors(y)
# norm <- cpm(y, normalized.lib.sizes=TRUE) not_norm <- cpm(y,
# normalized.lib.sizes=FALSE) rd_genes = sample(rownames(norm), 500)
# heatmap(not_norm[rd_genes,], main = 'Sans normalisation')
# heatmap(norm[rd_genes,], main = 'Avec normalisation')
# Definition du design et des bonnes conditions de référence
# design <- model.matrix(~0 + groups) con <- makeContrasts((groupscnF -
# groupscnf), levels=design) fit <- glmQLFit(y, contrast = con) qlf.2vs1 <-
# glmQLFTest(fit) a = topTags(qlf.2vs1, n= 1000) a$table
# estimation of dispersion and tests
y <- edgeR::estimateDisp(y)Design matrix not provided. Switch to the classic mode.
pval = 0.01
et <- exactTest(y)
best = topTags(et, n = 20000)
best = best[best$table$FDR < pval, ]
best = best[abs(best$table$logFC) > 1, ]
# genes <- rownames(cpm(y, normalized.lib.sizes=TRUE))
DEgenes = rownames(best$table)
plotBCV(y)On utilise des modèles de mélange pour regrouper les gènes ayant des profils d’expression similaires dans les différentes conditions.
AT5G46900 AT2G39040 AT4G16370 AT5G46890 AT1G51830 AT3G22910 AT1G03870 AT5G17760
25044 18629 150324 23563 35374 20583 89198 69411
AT4G22460 AT1G05660 AT1G69880 AT1G29100 AT3G60420 AT1G09932 AT5G28510 AT1G65570
9066 6423 13183 15636 9823 48109 9953 8104
AT1G74590 AT4G33550 AT5G53486 AT1G53830 AT1G12080 AT5G24100 AT2G29460 AT1G52130
27498 7100 5780 34734 102672 13354 13362 8162
AT2G43510 AT5G39580 AT1G43910 AT1G56430 AT5G11920 AT1G06120 AT1G51840 AT1G13520
9483 74122 36277 34411 61112 10592 8682 32308
AT4G17030 AT4G21680 AT1G78780 AT3G53540 AT5G24210 AT4G30170 AT3G63380 AT5G13320
92957 45645 12538 42649 2049 292312 39178 27057
AT2G18660 AT3G46400 AT1G53940 AT3G12230 AT5G11210 AT3G53150 AT4G32950 AT1G44130
3177 6114 31564 7104 7171 17026 42453 3119
AT5G06730 AT1G66090 AT4G15290 AT4G33790 AT1G12200 AT4G12480 AT2G14247 AT5G52760
36633 5588 17498 12613 66910 79987 25176 5517
AT1G63750 AT5G13750 AT5G53250 AT2G26560 AT1G26390 AT3G23730 AT1G27140 AT4G32810
13249 49596 49109 98987 14295 15400 4996 13727
AT4G12520 AT5G49690 AT5G44130 AT1G35230 AT1G51420 AT5G02490 AT5G25810 AT3G45700
21121 5561 16659 4903 76848 38568 36170 5880
AT1G48260 AT2G30660 AT3G26610
20904 95158 9610
[ reached getOption("max.print") -- omitted 4078 entries ]
conds <- str_split_fixed(colnames(dataC), "_", 2)[, 1]
run_pois <- coseq(dataC, conds = conds, K = 8:12, model = "Poisson", iter = 5, transformation = "none")****************************************
coseq analysis: Poisson approach & none transformation
K = 8 to 12
Use set.seed() prior to running coseq for reproducible results.
****************************************
Running g = 8 ...
[1] "Initialization: 1"
[1] "Log-like diff: 878.404180648169"
[1] "Log-like diff: 651.349925622305"
[1] "Log-like diff: 805.395049068972"
[1] "Log-like diff: 434.786231420049"
[1] "Log-like diff: 966.648418756102"
Running g = 9 ...
[1] "Initialization: 1"
[1] "Log-like diff: 0.582796121679099"
[1] "Log-like diff: 0.0321640837230284"
[1] "Log-like diff: 0.0137844118717094"
[1] "Log-like diff: 0.00252795050268162"
[1] "Log-like diff: 0.00051506274905222"
Running g = 10 ...
[1] "Initialization: 1"
[1] "Log-like diff: 683.054035391224"
[1] "Log-like diff: 271.024133213778"
[1] "Log-like diff: 581.694897322395"
[1] "Log-like diff: 656.650402574313"
[1] "Log-like diff: 1288.60256562109"
Running g = 11 ...
[1] "Initialization: 1"
[1] "Log-like diff: 908.727295904043"
[1] "Log-like diff: 314.87446940178"
[1] "Log-like diff: 1709.74488523059"
[1] "Log-like diff: 351.701362997625"
[1] "Log-like diff: 1193.43828507614"
Running g = 12 ...
[1] "Initialization: 1"
[1] "Log-like diff: 380.573834136091"
[1] "Log-like diff: 161.079292460031"
[1] "Log-like diff: 140.210806395253"
[1] "Log-like diff: 48.1565928423826"
[1] "Log-like diff: 87.7598594277771"
$logLike
$ICL
$profiles
$boxplots
$probapost_boxplots
$probapost_barplots
$probapost_histogram
*************************************************
Model: Poisson
Transformation: none
*************************************************
Clusters fit: 8,9,10,11,12
Clusters with errors: ---
Selected number of clusters via ICL: 12
ICL of selected model: -4056350
*************************************************
Number of clusters = 12
ICL = -4056350
*************************************************
Cluster sizes:
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
264 706 407 142 363 271 195
Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12
634 603 174 73 321
Number of observations with MAP > 0.90 (% of total):
3971 (95.62%)
Number of observations with MAP > 0.90 per cluster (% of total per cluster):
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
258 684 378 137 348 258 179
(97.73%) (96.88%) (92.87%) (96.48%) (95.87%) (95.2%) (91.79%)
Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12
615 574 169 69 302
(97%) (95.19%) (97.13%) (94.52%) (94.08%)
On représente le clustering dans la plan principal d’une ACP
suppressMessages(library(ade4, warn.conflicts = F, quietly = T))
suppressMessages(library(adegraphics, warn.conflicts = F, quietly = T))
# fixation correcte des contrastes et formule
acp <- dudi.pca(log(dataC + 0.1), center = TRUE, scale = TRUE, scannf = FALSE, nf = 4)
summary(acp)Class: pca dudi
Call: dudi.pca(df = log(dataC + 0.1), center = TRUE, scale = TRUE,
scannf = FALSE, nf = 4)
Total inertia: 24
Eigenvalues:
Ax1 Ax2 Ax3 Ax4 Ax5
19.6234 3.1216 0.3283 0.1563 0.1041
Projected inertia (%):
Ax1 Ax2 Ax3 Ax4 Ax5
81.7643 13.0069 1.3679 0.6513 0.4337
Cumulative projected inertia (%):
Ax1 Ax1:2 Ax1:3 Ax1:4 Ax1:5
81.76 94.77 96.14 96.79 97.22
(Only 5 dimensions (out of 24) are shown)
library("RColorBrewer")
dataC$cluster = clusters_per_genes[as.vector(rownames(dataC))]
s.corcircle(acp$co, xax = 1, yax = 2, fullcircle = FALSE, pback.col = "lightgrey")adegraphics::s.class(acp$li, xax = 1, yax = 2, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 2, yax = 3, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 4, yax = 2, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9))adegraphics::s.class(acp$li, xax = 4, yax = 3, as.factor(dataC$cluster), labels = as.character(levels(as.factor(dataC$cluster))),
col = brewer.pal(n = 10, name = "Paired"), chullSize = 1, ellipseSize = 0, plabels.cex = 0.7,
pbackground.col = "grey85", main = "Clusters dans le plan principal", ylim = c(-9,
9)) Il semblerait que l’ACP détecte dans un premier temps (avec le premier vecteur principal) la valeur moyenne d’expression, puis l’expression diff induite par le fer.
On peut faire des représentations dans le plan des seconds et troisième axes principaux (le premier traduit le fer, le second la carence en nitrate).
Sur le cercle de corrélations dans le plan 2 et 3, on voit bien que lorsque la carence fer est là, on peut différencier un effet nitrate. Le CO2 est, lui encore difficile à identifier, le 4eme axe de l’ACP ne renseignant pas beaucoup…